Artificial intelligence-aided protein engineering: from topological data analysis to deep protein language models

Protein engineering is an emerging field in biotechnology that has the potential to revolutionize various areas, such as antibody design, drug discovery, food security, ecology, and more. However, the mutational space involved is too vast to be handled through experimental means alone. Leveraging accumulative protein databases, machine learning (ML) models, particularly those based on natural language processing (NLP), have considerably expedited protein engineering. Moreover, advances in topological data analysis (TDA) and artificial intelligence-based protein structure prediction, such as AlphaFold2, have made more powerful structure-based ML-assisted protein engineering strategies possible. This review aims to offer a comprehensive, systematic, and indispensable set of methodological components, including TDA and NLP, for protein engineering and to facilitate their future development.


Mathematical theory of topological data analysis (TDA) Simplicial complex and chain complex
Graph is a representation for a point cloud consisting of vertices and edges for modeling pairwise interactions, such as atoms and bonds in molecules. Simplicial complex, the generalization of graph, constructs more enriched shapes to include high dimensional objects. A simplicial complex is composed of simplexes up to certain dimensions. A k-simplex, σ k , is a convex hull of k + 1 affinely independent points v 0 , v 1 , v 2 , · · · , v k : (4) In Euclidean space, 0-simplex is a point, 1-simplex is an edge, 2-simplex is a triangle, and 3-simplex is a tetrahedron. The k-simplex can describe abstract simplex for k > 3.
A subset of the k + 1 vertices of a k-simplex, σ k , with m + 1 vertices forming a convex hull in a lower dimension and is called an m-face of the k-simplex σ m , denoted as σ m ⊂ σ k . A simplicial complex K is a finite collection of simplexes satisfying two conditions: 1) Any face of a simplex in K is also in K.
2) The intersection of any two simplexes in K is either empty or a shared face.
The interactions between two simplexes can be described by adjacency. For example, in graph theory, two vertices (0-simplexes) are adjacent if they share a common edge (1simplex). Adjacency for k-simplexes with k > 0 includes both upper and lower adjacency. Two distinct k-simplexes, σ 1 and σ 2 , in K are upper adjacent, denoted σ 1 ∼ U σ 2 , if both are faces of a (k + 1)-simplex in K, called a common upper simplex. Two distinct k-simplexes, σ 1 and σ 2 , in K are lower adjacent, denoted σ 1 ∼ L σ 2 , if they share a common (k − 1)-simplex as their face, called a common lower simplex. Either common upper simplex or common lower simplex is unique for two upper or lower adjacent simplexes. The upper degree of a k-simplex, deg U (σ k ), is the number of (k + 1)-simplexes in K of which σ k is a face; the lower degree of a k-simplex, deg L (σ k ), is the number of nonempty (k − 1)-simplexes in K that are faces of σ k , which is always k + 1. The degree of k-simplex (k > 0) is defined as the sum of its upper and lower degree For k = 0, the degree of a vertex is: A simplex has orientation determined by the ordering of its vertices, except 0-simplex. For example, clockwise and anticlockwise orderings of three vertices determine the two orientation of a triangle. Two simplexes, σ 1 and σ 2 , defined on the same vertices are similarly oriented if their orderings of vertices differ from an even number of permutations, otherwise, they are dissimilarly oriented. Algebraic topology provides a tool to calculate simplicial complex. A k-chain is a formal sum of oriented k-simplexes in K with coefficients on Z. The set of all k-chains of simplicial complex K together with the addition operation on Z constructs a free Abelian group C k (K), called chain group. To link chain groups from different dimensions, the k-boundary operator, ∂ k : , maps a k-chain in the form of a linear combination of k-simplexes to the same linear combination of the boundaries of the k-simplexes. For a simple example where the k-chain has one oriented k-simplex spanned by k+1 vertices as defined in Eq. (4), its boundary operator is defined as the formal sum of its all (k − 1)-faces: The most important topological property is that a boundary has no boundary: A sequence of chain groups connected by boundary operators defines the chain complex: (8) When n exceeds the dimension of K, C n (K) is an empty vector space and the corresponding boundary operator is a zero map.

Filtration for multiscale chain complexes
Filtration is a process that constructs a nested sequence of simplicial complex allowing a multiscale analysis of the point cloud. It creates a family of simplicial complexes ordered by inclusion ( Figure 2c): where K is the largest simplicial complex can be obtained from the point cloud.
The filtration induces a sequence of chain complexes is the chain group for subcomplex K t , and its k-boundary operator is ∂ t k : is the co-boundary operator. Associated with the k-boundary operator, its adjoint operator is the k-adjoint boundary operator, ∂ t * k : C k−1 (K t ) → C k (K t ). There are various simplicial complex that can be used to construct the filtration, such as Rips complex,Čech complex, and Alpha complex. For example, the Rips complex of K with radius t consists of all simplexes with diameter at most 2t: (11)

Homology group and persistent homology
With the chain complex defined in Eq. (8), the k-cycle and k-boundary groups are defined as: Then the k-th homology group H k is defined as The k-th Betti number, β k , is defined by the rank of k-th homology group H k which counts k-dimensional holes. For example, β 0 = rank(H 0 ) reflects the number of connected components, β 1 = rank(H 1 ) reflects the number of loops, and β 2 = rank(H 2 ) reveals the number of voids or cavities. Persistent homology is devised to track the multiscale topological information along the filtration [1]. The inclusion where Z t k = ker ∂ t k and B t+p k = im ∂ t+p k+1 . Intuitively, this homology group records the k-dimensional homology classes of K t that are persistent at least until K t+p . The birth and death of homology classes can be represented by a barcode, a set of intervals (Figure 2d).

Combinatorial Laplacian.
For k-boundary operator ∂ k : C k → C k−1 in K, let B k be the matrix representation of this operator relative to the standard bases for C k and C k−1 in K. of C k and C k−1 . Associated with the boundary operator ∂ k , the adjoint boundary operator is ∂ * k : C k−1 → C k , where its matrix representation is the transpose of the matrix, B T , with respect to the same ordered bases to the boundary operator.
The k-combinatorial Laplacian, a topological Laplacian, is a linear operator ∆ k : C k (K) → C k (K) and its matrix representation, L k , is given by In particular, the 0-combinatorial Laplacian (i.e. graph Laplacian) is given as follows since ∂ 0 is an zero map: The elements of k-combinatorial Laplaicn matrices are For k = 0, the graph Laplacian matrix L 0 is The multiplicity of zero spectra of L k gives the Betti-k number, according to combinatorial Hodge theorem [2]: The Betti numbers describe topological invariants. Specifically, β 0 , β 1 , and β 2 may be regarded as the numbers of independent components, rings, and cavities, respectively.

Persistent spectral graph (PSG)
The homotopic shape changes with a small increment of filtration parameter may be subject to noise from the data. The persistence may be considered to enhance the robustness when calculating the Laplacian. First, we define the p-persistent chain group C t,p k ⊆ C t+p k whose boundary is in C t k−1 : where ∂ t+p k : C t+p k → C t+p k−1 is the k-boundary operator for chain group C t+p k . Then we can define a p-persistent boundary operator, ð t,p k , as the restriction of ∂ t+p k on the p-persistent chain group C t,p k : Then PSG defines a family of p-persistent k-combinatorial Laplacian operators ∆ t,p k : C k (K t ) → C k (K t ) [3,4] which is defined as We denote B t,p k+1 and B t k as the matrix representations for boundary operators ð t,p k+1 and ∂ t k , respectively. Then the Laplacian matrix for ∆ t,p k is L t,p k = B In addition, the rest of the spectra, i.e., the non-harmonic part, capture additional geometric information. The family of spectra of the persistent Laplacians reveals the homotopic shape evolution [5].